I. Introduction

As the housing market is highly reliant on price predictions for both sellers and buyers, it is essential to make a highly accurate prediction model for them. Zillow is an American real estate company with a large database of homes for sale and rent. However, because the range of their data is across the United States, it is hard for them to make precise housing price predictions in a specific city. The purpose of the project is to help Zillow make a prediction model of the housing price in Philadelphia based on local characteristics. The model would provide a strong guide for sellers and buyers to make decisions so that they would not overpay much. It would also reduce the risks of credibility and enhance market efficiency.

The overall modeling strategy for the project is using Ordinary Least Squares (OLS) regression with multi-variables, which is a popular statistical method used to examine the relationship between a dependent variable and predictors that might affect the value of the dependent variable. The method provides the strength of the relationship, the direction of the relationship, and the goodness of the model fit. In our case, there are still some challenges we should overcome. First of all, we need to filter out the most significant and diverse variables to use in this model and find the data from reliable sources. Second, we need to provide strong evidence and clear visualizations of plots and maps with comprehensive interpretations for the audience. Third, we should state the limitations of the OLS regression results.

Our results of the model indicate that using local data can enhance the accuracy of housing price predictions for Philadelphia, but because there are errors between the predicted price and the actual housing price, we need further improvements. The model prediction can only be used as a consulting reference. Changes in the housing market would affect future housing prices as well. There are still unpredictable factors in the real world, such as the Covid-19 pandemic, which impact housing prices negatively.

II. Data Wrangling

To decide on the specific variables that might affect housing prices, we start with three categories: interior conditions, neighborhood environments, and demographic characteristics. According to Zillow’s website, the living area, the house age and utilities are listed in the overview section of each house’s information, which are the top factors that people consider the most when they buy a house. Therefore, we chose the columns “total_area”, “year_built”, and “quality_grade” in the original dataset to represent these three factors. For the environmental factors, people normally consider the safety and convenience of public resources, so we explored the OpenDataPhilly website and used the data of shooting victims, commercial corridors and schools. For the demographic characteristics, we retrieved the census data for Philadelphia in 2021 from U.S. Census Bureau, which contains the median household income, percentage of foreign-born residents, and geographical mobility to a different house.

2.1 Import packages and start the setting

# Load Libraries
library(geos)
library(rsgeo)
library(tidyverse)
library(tidycensus)
library(sf)
library(kableExtra)
library(knitr)
library(dplyr)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(stargazer)
library(ggplot2)


options(scipen=999)
options(tigris_class = "sf")

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

2.2 Import and clean the general dataset

Since our project is heavily focused on geographical features, we retrieved the block groups and neighborhood boundaries from OpenDataPhilly, and we filtered the variables that we need to use from the housing price raw data.

In order to better apply the environmental impact to housing prices, we use two different ways. For the shooting data, we selected the number of victims who suffered the gun shot on the head within 1000 feet around each house because the head wound is the most dangerous body part that caused the death. Also, because the location and time of the crime is random, a certain buffer is appropriate to measure safety levels in a community. Even a fatal crime at a relatively far distance can cause panic and affect the housing price. Therefore, we choose the buffer to sum crimes rather than using the average nearest neighbor distance.

The second way is applied to schools, commercial corridors and public parks and recreational centers. We used the method of calculating the Euclidean distance this time because the distance to these public resources is much more important. It is used to represent the convenience of living in a house. The variables for further analysis are the distances from each house to its nearest public amenity.

nearest_school <- sf::st_nearest_feature(Philly_price_clean, schools.sf)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(schools.sf)

Philly_price_clean$dist_to_school <- rsgeo::distance_euclidean_pairwise(x, y[nearest_school])

nearest_Commercial <- sf::st_nearest_feature(Philly_price_clean, Commercial_Corridors)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(Commercial_Corridors)
Philly_price_clean$dist_to_commerce <- rsgeo::distance_euclidean_pairwise(x, y[nearest_Commercial])

nearest_PPR <- sf::st_nearest_feature(Philly_price_clean, PPR_Sites)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(PPR_Sites)
Philly_price_clean$dist_to_PPR <- rsgeo::distance_euclidean_pairwise(x, y[nearest_PPR])

For the census variables, we simply joined the data of each block groups to the houses within that block group. We also assigned the neighborhood of each house for the future use.

Philly_Housing_joined <-
  st_join(Philly_price_clean, blockgroup21, join = st_within)
Philly_Housing_joined <-
  st_join(Philly_Housing_joined, neighborhood, join = st_within)

Summary statistics with variable descriptions

Table 2.1 gives us a comprehensive summary of the independent variables that might affect the housing prices. The total number of the sample is 23781. Here is a descriptions of the variables:

House condition: - total_area: the area that the owners have, which includes the living area and the outdoor area - houseAge: the age of the house since it was first built. - house_Condition: the quality of the house valued by numbers

Demographic characteristics: - MedHHInc: the median household income in the past 12 months (in 2021 inflation-adjusted dollars) - pctForeign: the percentage of residents who were born outside of the United States among all residents - pctMobility: the percentage of geographic mobility in a different house in United States 1 year ago

Environmental impact: - shooting.Buffer: the sum of victims suffered from head gun shot within a 1000 feet buffer of each house - dist_to_school: the distance of a house to the nearest school - dist_to_commerce: the distance of a house to the nearest commercial corridor - dist_to_PPR: the distance of a house to the nearest public park and recreational center

The table calculates the maximum, minimum, mean, and standard deviation of each variable, which indicates a large variability of the data from the standard deviation. We need to consider if there are any outliers for the future analysis.

numeric_columns <- sapply(Philly_Housing_joined, is.numeric)
for (col in names(Philly_Housing_joined)[numeric_columns]) {
  col_mean <- mean(Philly_Housing_joined[[col]], na.rm = TRUE)
  Philly_Housing_joined[[col]][is.na(Philly_Housing_joined[[col]])] <- col_mean
}

house_condition <- st_drop_geometry(Philly_Housing_joined) %>%
  select(total_area, houseAge, interior_condition) %>%
  gather(variable, value) %>%
  group_by(variable) %>%
  summarize(
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    Min = min(value, na.rm = TRUE),
    Max = max(value, na.rm = TRUE)
  )

demo <- st_drop_geometry(Philly_Housing_joined) %>%
  select(MedHHInc, pctForeign, pctMobility) %>%
  gather(variable, value) %>%
  group_by(variable) %>%
  summarize(
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    Min = min(value, na.rm = TRUE),
    Max = max(value, na.rm = TRUE)
  )

envi_impact <- st_drop_geometry(Philly_Housing_joined) %>%
  select(shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR) %>%
  gather(variable, value) %>%
  group_by(variable) %>%
  summarize(
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    Min = min(value, na.rm = TRUE),
    Max = max(value, na.rm = TRUE)
  )

summary_stats <- rbind(house_condition, demo, envi_impact)  
summary_table <- knitr::kable(summary_stats, "html", digits = 1, caption = "Table 2.1 Summary Statistics") %>%
  kable_styling(full_width = F)

grouped_table <- group_rows(summary_table, "House Condition", 1, 3)
grouped_table <- group_rows(grouped_table, "Demographic Characteristics", 4, 6)
grouped_table <- group_rows(grouped_table, "Environmental Impact", 7, 10)

grouped_table
Table 2.1 Summary Statistics
variable mean sd Min Max
House Condition
houseAge 85.4 29.7 0.0 273.0
interior_condition 3.6 0.9 0.0 7.0
total_area 1828.9 3813.0 0.0 226512.0
Demographic Characteristics
MedHHInc 65581.2 31777.9 6302.0 250001.0
pctForeign 0.1 0.1 0.0 0.8
pctMobility 0.1 0.1 0.0 0.8
Environmental Impact
dist_to_PPR 1796.5 1032.0 79.9 10317.7
dist_to_commerce 670.0 638.3 0.0 7385.7
dist_to_school 1147.7 647.6 32.4 5900.8
shooting.Buffer 3.4 2.5 1.0 32.0

Correlation Matrix across variables

Figure 2.2 is a correlation matrix among the numerical variables, which represents the pairwise degree of correlation between two variables on the x-axis and y-axis. The correlation coefficient represents positive, negative, or no correlation. A positive correlation means that when the value of one variable increases, the value of the other variable increases and vice versa. 0 means no correlation, and the changes in one variable do not predict changes in the other. Most of the variable pairs have a 0 correlation or a slightly negative correlation, but a few pairs, such as distance to school with distance to park, are highly positively correlated.

numericVars <- st_drop_geometry(Philly_Housing_joined)[, c("total_area", "houseAge", "interior_condition", "MedHHInc", "pctForeign", "pctMobility", "shooting.Buffer", "dist_to_school", "dist_to_commerce", "dist_to_PPR")]

ggcorrplot(
  round(cor(numericVars), 1),
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "blue"),
  type="lower",
  insig = "blank") +
    labs(title = "Correlation matrix across variables", 
         caption = "Figure 2.2"
         )

Home price correlation scatterplot with median household income

This scatterplot (Figure 2.3) indicates a relationship between home price and median household income. The blue line is a line of best fit, which has a positive but slow slope. As the median household income increases, the house price also increases, but it would not increase a lot. Points are also concentrated in the bottom left corner, meaning that households with lower income tend to live in houses with lower prices. However, the data is more scattered for households with high incomes. Their home prices vary across the Y-axis. There are also three outliers which have a price near 600 million.

ggplot(Philly_Housing_joined, aes(x = MedHHInc, y = sale_price)) +
  geom_point(size = .5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Price as a Function of Median Household Income",
    x = "Median Household Income",
    y = "Home Price",
    caption = "Figure 2.3"
  ) +
  theme()

Home price correlation scatterplot with house age

Figure 2.4 is a scatterplot showing the correlation between house ages and house prices. The best fit line shows that as house ages increase, the house prices gently decrease. There is also a large variety in home prices at a specific house age. We might consider if house age is a good predictor of house value.

ggplot(Philly_Housing_joined, aes(x = houseAge, y = sale_price)) +
  geom_point(size = .5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Price as a Function of House Age",
    x = "Age",
    y = "Home Price",
    caption = "Figure 2.4"
  ) +
  theme_minimal()

Home Price Correlation Scatterplot with total areas

Figure 2.5 shows the correlation between the total area of the house and the house price. There is a strong positive relationship as the slope of the best fit line is greater than the previous two plots. House prices increase as the total areas are larger. However, there is still a variety of house prices when the total areas are relatively small. For example, there are some outliers that a small area of a house has significant high prices.

ggplot(Philly_Housing_joined, aes(x = total_area, y = sale_price)) +
  geom_point(size = .5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Price as a Function of Total Areas",
    x = "total areas",
    y = "Home Price",
    caption = "Figure 2.5"
  ) +
  theme_minimal()

Home Price Correlation with Distance to Parks and Recreation

Figure 2.6 shows the correlation between the distance to the nearest parks and recreation and the house price. The slope of the best fit line is not obvious, which is almost horizontal to the x-axis. It means that no matter how far the distances to parks and recreation center change, the house prices are not affected by them too much.

ggplot(Philly_Housing_joined, aes(x = dist_to_PPR, y = sale_price)) +
  geom_point(size = .5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Price as a Function of Distance to Parks and Recreation",
    x = "Distance to Parks and Recreation",
    y = "Home Price",
    caption = "Figure 2.6"
  ) +
  theme_minimal()

Map of dependent variable (sale price)

Figure 2.7 is a choropleth map of the sales price distribution by block groups in Philadelphia. The housing value is grouped in equal intervals, and the lighter blue represents the lower sale prices, while the darker blue represents the higher sale prices. Most of the neighborhoods have relative low prices less than 1000000. Prices are higher in the northwest block groups as well as in the downtown areas.

Philly_price_Model <- Philly_Housing_joined %>%
  filter(toPredict == "MODELLING")

mean_sale_price <- Philly_price_Model %>%
  group_by(GEOID) %>%
  summarize(MeanSalePrice = mean(sale_price))

blockgroup21 <- st_join(blockgroup21, mean_sale_price, by = "GEOID")
blockgroup21$MeanSalePrice[is.na(blockgroup21$MeanSalePrice)] <- 0
ggplot() +
  geom_sf(data = blockgroup21, aes(fill = MeanSalePrice)) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", 
                      name = "Sale Price (U.S. Dollars)") +
  labs(title = "Choropleth Map of Housing Sale Price by Neighborhoods",
       caption = "Figure 2.7") +
  theme_minimal()

Geographic mobility in block groups

The map of the geographic mobility by block groups (Figure 2.8) shows a scattered pattern. There is no cluster that several adjacent block groups have the highest or the lowest mobility rate. However, there is one block group in the dark blue which has a significant high mobility rate.

ggplot() +
  geom_sf(data = blockgroup21, aes(fill = pctMobility)) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Geographic Mobility Rate") +
  labs(title = "Choropleth Map of Geographic Mobility by Block Groups",
       caption = "Figure 2.8") +
  theme_minimal()

Nearest school distance

The map (Figure 2.9) shows the average distance to the nearest school by block groups. It is obvious to see that the south Philadelphia has a shorter average distance and the distance is longer in the most blocks of the north Philadelphia. Block groups in downtown Philadelphia have similar distances from 2000 to 3000 feet.

mean_school_dis <- Philly_Housing_joined %>%
  group_by(GEOID) %>%
  summarize(MeanSchoolDis = mean(dist_to_school))

blockgroup21 <- st_join(blockgroup21, mean_school_dis, by = "GEOID")
blockgroup21$MeanSchoolDis[is.na(blockgroup21$MeanSchoolDis)] <- 0
ggplot() +
  geom_sf(data = blockgroup21, aes(fill = MeanSchoolDis)) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Average Distance") +
  labs(title = "Choropleth Map of Distance to School by Block Groups",
       caption = "Figure 2.9") +
  theme_minimal()

Shooting Victims with Head Injuries

The heat map (Figure 2.10) shows the general density of victims who suffered from head gun shot wounds. It is clear to see that the incidents happened mostly in the downtown and west of Philadelphia.

ggplot() + geom_sf(data = blockgroup21, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(PhillyShootingHead.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "lightblue", high = "purple", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density Map of Shooting Wound in Head, Philadelphia",
       caption = "Figure 2.10") +
  mapTheme()

Distance to commercial corridors by block groups.

The map (Figure 2.11) shows the average distance to the nearest commercial corridor by block groups. It is obvious to see that the suburban Philadelphia,especially those on the far northeast has a longer average distance to commercial corridors. Block groups in center,west, and southeast Philadelphia have similar distances less than 2000 feet.

mean_CC_dis <- Philly_Housing_joined %>%
  group_by(GEOID) %>%
  summarize(MeanCCDis = mean(dist_to_commerce))

blockgroup21 <- st_join(blockgroup21, mean_CC_dis, by = "GEOID")
blockgroup21$MeanCCDis[is.na(blockgroup21$MeanCCDis)] <- 0
ggplot() +
  geom_sf(data = blockgroup21, aes(fill = MeanCCDis)) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Average Distance to Commerical Corridors") +
  labs(title = "Choropleth Map of Distance to Commercial Corridors by Block Groups",
       label = "Figure 2.11") +
  theme_minimal()

III. Methods

3.1 Coding techniques:

We made our OLS regression analysis in the R programming language. The use of packages helped us create data visualizations and statistical calculations, including KABLE, GGplot, Stargazer, etc. We combined our code and report into an R markdown file, which can be knitted into an HTML file eventually.

3.2 Data Analysis:

Data integration:

In part II, we introduced our datasets and provided a general summary of the statistics of our predictors. The correlation matrix, scatterplots and geospatial visualization create a multidimensional way to understand predictors. A comprehensive table of summary statistics was produced to provide insights into the distribution of all variables in the dataset. These variables were sorted into categories for easy reference: internal characteristics, amenities/public services, and spatial structure.

OLS regression analysis

First of all, we separated the data which contains the actual sale prices from the data which needs to be predicted. For the modelling data, we split it into training and test sets in the proportion of 3:2. The training data is for the machine learning algorithm to learn and make the model with the minimum error, while the test data is used to evaluate the model’s performance for predicting house values. Therefore, we made a linear regression model using the lm function in R, and the values in the model will be interpreted in the Results section. Mean absolute error is also calculated for the test set to measure the accuracy of the model.

Second, we went through 100-fold cross validation to enhance the model’s performance. It split the test data into 100 parts equally and randomly to train on k-1 of the folds repeatedly. To interpret the mean absolute error and the correlation between the predicted results and the actual sale prices in a more visual way, we created a histogram plot, a scattered plot, and a map to show its distribution.

Third, we did a Moran’s I test in the moran.mc function with neighbors of 5 to test the spatial autocorrelation and check the cluster pattern of house prices. The Moran’s I value near 1 or -1 means the strong cluster or dispersion of the areas, while a 0 value means a random pattern. The plot shows the value of Moran’s I and the frequency of 999 random permutated I. A scatterplot of the spatial lag of errors and prediction errors is also displayed. It is used to show if spatial autocorrelation in errors exists. With the above model, we put our unpredicted data into the regression model and showed the results on a map.

The last part of the analysis is the generalizability of the neighborhood model, which is to test the accuracy of the prediction across the space. To achieve this, we first ran another regression by each neighborhood and created a map to visualize the mean absolute percent errors by neighborhood. We also did testing using Census data of African American residents to categorize the neighborhoods and making tables to compare the mean absolute percent errors.

IV. Results

Training set linear model summary results

Table 3.1 summarizes the results of the linear model when we ran the OLS regression algorithm. The total amount of observations is 14272 from the training dataset. Our dependent variable is the house sale price, and each variable below is the predictor for the house sale price. The constant is 289866, which is the value of house price when the value of every variable is 0. The number in the second column for each variable row is the coefficient, and the positive or negative number decides the increase or the decrease of the sale price when the variable increases one unit. For example, as house age increases one year, the sale price will decrease $21.654. The number in the parentheses is the standard errors for each coefficient, and smaller errors indicates that the coefficient is more reliable. The total area of the house has the standard error of 0.452, but that of mobility percent is 15937.41, which means that the coefficient of the total area is more precise than the mobility percentage.

The stars in the upper right of each coefficient represents the statistical significance level, which uses p-value to measure, and it gives the information of the probability to reject the null hypothesis that there is no relationship between the house sale price and each predictor. Total area, interior condition, shooting victims, distance to commercial corridors, distance to PPR, median household income, foreign-born percentage and mobility rate all have a p-value less than 0.01, meaning that these variables play a significant role in predicting the sale price. Predictor house age has a p-value larger than 0.1, and distance to school has a p-value smaller than 0.1. Therefore, these two play a less significant role and are not a significant predictor of the housing prices.

The R-square and adjusted R-square is 0.359 and 0.360, which indicates the measure of how well the model fits the data. In our model, the variables explain 38% of the variability in the house prices can be explained by the predictors we have. This value is an intermediate fit for the regression model.

The F-statistic is used to test the hypotheses about the variances in the data. Its value in our model is 800.92, and it has a p-value smaller than 0.01, which indicates that the overall model is statistically significant and at least one predictor is significantly related to the housing prices.

## Separate the dataset into training and test data
inTrain <- createDataPartition(
              y = paste(Philly_price_Model$interior_condition), 
              p = .60, list = FALSE)
PhillyPrice.training <- Philly_price_Model[inTrain,] 
PhillyPrice.test <- Philly_price_Model[-inTrain,]  

## Create the linear regression model on the training data
reg.training <-
  lm(sale_price ~ ., data = as.data.frame(PhillyPrice.training) %>%
                             dplyr::select(sale_price, total_area, interior_condition, houseAge, shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR, MedHHInc, pctForeign, pctMobility))

# Create a table of the model summary using stargazer
stargazer(reg.training,
          title = "Table 3.1. Linear Regression Model Summary",
          type = "text",
          align = TRUE
          )
## 
## Table 3.1. Linear Regression Model Summary
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                             sale_price         
## -----------------------------------------------
## total_area                   14.199***         
##                               (0.452)          
##                                                
## interior_condition        -69,936.980***       
##                             (2,021.539)        
##                                                
## houseAge                      -21.654          
##                              (61.099)          
##                                                
## shooting.Buffer            -5,484.062***       
##                              (665.364)         
##                                                
## dist_to_school                5.254*           
##                               (2.823)          
##                                                
## dist_to_commerce            -27.518***         
##                               (2.724)          
##                                                
## dist_to_PPR                  7.560***          
##                               (1.747)          
##                                                
## MedHHInc                     2.929***          
##                               (0.054)          
##                                                
## pctForeign                 77,992.980***       
##                            (13,512.380)        
##                                                
## pctMobility               243,826.800***       
##                            (15,937.410)        
##                                                
## Constant                  289,866.000***       
##                            (10,931.520)        
##                                                
## -----------------------------------------------
## Observations                  14,272           
## R2                             0.360           
## Adjusted R2                    0.359           
## Residual Std. Error  195,617.500 (df = 14261)  
## F Statistic         800.920*** (df = 10; 14261)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Table of mean absolute error and MAPE for a single test set

Table 3.2 shows the mean absolute error and mean absolute percent error. The absolute error calculates the difference between the observed price and the predicted price for each observation, and the absolute percent error is on the percentage basis. The mean value of these two is an overview of the errors in our training set.

We have the mean absolute error of 104648.1, meaning that the predicted prices are 104648.1 less or larger than the actual prices on average, and the error takes 37% of the whole predicted prices. While the MAE and MAPE refer to the validation of the fitness of the model, we estimate that our model is not problematic, but it does not give a high fitness as well. Therefore, we need a further validation approach.

## Calculate the error of the predicted price values
PhillyPrice.test <-
  PhillyPrice.test %>%
  mutate(sale_price.Predict = predict(reg.training, PhillyPrice.test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict) %>%
  filter(sale_price < 5000000)

## Create the summary table
error_summary <- st_drop_geometry(PhillyPrice.test) %>%
  summarize(
    MAE = mean(sale_price.AbsError),
    MAPE = mean(sale_price.APE)
  )

kable(error_summary, "html", caption = "Table 3.2 Error Summary for Test Set", digits = 2) %>%
  kable_styling()
Table 3.2 Error Summary for Test Set
MAE MAPE
104648.1 0.37

Cross validation tests with 100 folds

Below shows the 100-fold cross validation results, which calculates the average goodness of fit across 100 equal sized sublets. The Root Mean Square Error (RMSE) is 180808.5, which is the square root of the mean square error. The value is not low so that our prediction model is not the best, and cannot provide the most accurate prediction. The mean and standard deviation of Mean Absolute Error are 103295.6 and 11210.03, indicating that there is 11210.03 variation across the folds. To visualize the pattern and distribution of the Mean Absolute Error more clearly, Figure 3.3 is a histogram showing the frequency of the each mean absolute error in a $1000 binning. The frequency is high in the range of 90000 to 110000, but the distribution of errors are not cluster tightly. Therefore, the model is not very generalizable to the new data.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
## create cross-validating
reg.cv <- 
  train(sale_price ~ ., data = st_drop_geometry(Philly_price_Model) %>% 
                                dplyr::select(sale_price, 
                                total_area, interior_condition, 
                                houseAge, shooting.Buffer, dist_to_school, 
                                dist_to_commerce, dist_to_PPR, MedHHInc, 
                                pctForeign, pctMobility), 
     method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv
## Linear Regression 
## 
## 23781 samples
##    10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 23542, 23542, 23544, 23544, 23544, 23542, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   180808.5  0.4266118  103295.6
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
#mean and standard deviation MAE
mean(reg.cv$resample[,3])
## [1] 103295.6
sd(reg.cv$resample[,3])
## [1] 11210.03
cv_results <- data.frame(Resamples = names(reg.cv$resample),
                         MAE = reg.cv$resample$MAE)

# Create a histogram of MAE values
ggplot(cv_results, aes(x = MAE)) +
  geom_histogram(binwidth = 1000, fill = "blue", color = "black") +
  labs(title = "Cross-Validation Mean Absolute Error (MAE) Histogram",
       x = "Mean Absolute Error (MAE)",
       y = "Frequency",
       caption = "Figure 3.3")

Plot of predicted prices as a function of observed prices

Figure 3.4 maps the plot of predicted prices as a function of observed prices. The orange line is the perfect fit line, which has a slope of 1 and means that the predicted prices are exactly the same as the observed sale prices. The green line is the average predicted fit line. These two lines are close but are not overlapped, indicating that the model is not perfectly fit, but it is still a well-fit model. Most of the plots are distributed along the perfect fit line, but there are still a large amount of points that the actual sale prices are much higher than the predicted prices.

scatter_plot <- ggplot(data = PhillyPrice.test, aes(x = sale_price.Predict, y = sale_price)) +
  geom_point() +  # Add scatter points
  geom_abline(intercept = 0, slope = 1, color = "orange", linetype = "dashed") +  # Perfect fit line
  geom_smooth(method = "lm", se = FALSE, color = "blue", linetype = "dashed")+
  labs(title = "Scatter Plot of SalePrice vs. SalePrice.Predict",
       x = "SalePrice.Predict",
       y = "SalePrice",
       caption = "Figure 3.4")+
  scale_y_continuous(limits = c(0, max(PhillyPrice.test$sale_price)), expand = c(0,0))

print(scatter_plot)

A map of residuals for the test set.

Figure 3.5 visualizes the differences between the predicted prices and the observed prices in a geospatial way. As cyan-blue color represents the largest residuals, it is widely distributed across the city but more clustered in the downtown area. The small residuals in brown color are clustered in the northwest and the south of Philadelphia.

palette5 <- c("#a6611a", "#dfc27d", "#f5f5f5",   "#80cdc1", "#018571")
ggplot() +
  geom_sf(data = blockgroup21, fill = "grey40") +
  geom_sf(data = PhillyPrice.test, aes(colour = q5(sale_price.Error)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   name="Quintile\nBreaks") +
  labs(title="Map of Residuals of test set",
       caption = "Figure 3.5") +
  mapTheme()

Moran’s I test results

Figure 3.6 shows the Moran’s I test histogram and the observed Moran’s I value. The orange line shows the observed Moran’s I as 0.35 indicates that there is a moderate positive spatial autocorrelation. Areas with high house prices are somewhat likely to be clustered with high prices in their neighbors. The p-value of 0.001 refers to a statistically significance and the observed point process is more clustered than all 999 random permutations. The histogram is the frequency distribution of the permuted Moran’s I values. It has the highest frequency in the range of 0.

## create a neighbor list with 5 nearest neighbors for each house
coords <- st_coordinates(Philly_price_Model) 
neighborList <- knn2nb(knearneigh(coords, 5))
## create a spatial weights matrix to relate sale price with the neighbors
spatialWeights <- nb2listw(neighborList, style="W")
## Calculate the spatial lag of price
Philly_price_Model$lagPrice <- lag.listw(spatialWeights, Philly_price_Model$sale_price)

coords.test <-  st_coordinates(PhillyPrice.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

PhillyPrice.test <- PhillyPrice.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error))

moranTest <- moran.mc(PhillyPrice.test$sale_price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count",
       caption = "Figure 3.6") +
  plotTheme()

Scatter plot of error as a function of the spatial lag of price

Figure 3.7 maps the relationship between the spatial lag of price errors and the sale price error in a scatter plot. The spatial lag of price errors is the weighted mean of the errors by 5 neighbors. Points are clustered along the regression line, and there is a positive but not very strong slope of the spatial autocorrelation in the model errors. As the error of the house price increases, the errors also increase in the nearby neighbors.

ggplot(PhillyPrice.test, aes(x = lagPriceError, y = sale_price.Error)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "orange")+
  labs(title = "Error as a function of the spatial lag of errors",
       x = "Spatial Lag Price Error",
       y = "Sale price error",
       caption = "Figure 3.7")

A map of predicted values for both model dataset and the unpredicted dataset.

Figure 3.8 is a map of all predicted house prices in our dataset, including the unpredicted dataset based on our regression model. The map shows a clustered pattern of the higher and lower house prices. Houses in northeast, northwest and central Philadelphia have a higher predicted prices, while houses in central north and southwest have a lower predicted prices.

Philly_Housing_joined <-
  Philly_Housing_joined %>%
  mutate(sale_price.Predict = predict(reg.training, Philly_Housing_joined))

ggplot() +
  geom_sf(data = neighborhood, fill = "grey40") +
  geom_sf(data = Philly_Housing_joined, aes(colour = q5(sale_price.Predict)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   name="Quintile\nBreaks") +
  labs(title="Map of predicted values",
       caption = "Figure 3.8") +
  mapTheme()

A map of mean absolute percentage error (MAPE) by neighborhood.

The following results account for the spatial process at the neighborhood scale, and we created the neighborhood fixed effect feature. We made the regression by the neighborhoods in Philadelphia and Figure 3.9 shows the Mean Absolute Percentage Error geo-spatially. Most of the neighborhoods in the neighborhood effect regression have a relative low mean absolute percentage error, but in the central and west Philadelphia in the baseline regression, these neighborhoods encounters a high MAPE. The MAPE in the baseline regression have a relatively lower value. There is no significance difference between two maps except the single neighborhood in central Philadelphia, but the range of the MAPE in the neighborhood effects seems smaller, which shows that the map the neighborhood Effects model might be more generalizable as there is a more consistent accuracy across the neighborhoods than the baseline regression.

reg.nhood <- lm(sale_price ~ ., data = as.data.frame(PhillyPrice.training) %>% 
                                 dplyr::select(sale_price, total_area, interior_condition, houseAge, shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR, MedHHInc, pctForeign, pctMobility))

PhillyPrice.test.nhood <-
  PhillyPrice.test %>%
  mutate(Regression = "Neighborhood Effects",  
         sale_price.Predict = predict(reg.nhood, PhillyPrice.test),
         sale_price.Error = sale_price.Predict- sale_price,
         sale_price.AbsError = abs(sale_price.Predict- sale_price),
         sale_price.APE = (abs(sale_price.Predict- sale_price)) / sale_price) %>%
  filter(sale_price < 5000000) 

mape_neighborhood <- PhillyPrice.test.nhood %>%
  group_by(name) %>%
  summarize(MAPE = mean(sale_price.APE, na.rm = TRUE))

PhillyPrice.test <- PhillyPrice.test %>%
  mutate(Regression = "Baseline Regression")

bothRegressions <-
  rbind(
    dplyr::select(PhillyPrice.test, starts_with("sale_price"), Regression, name),
    dplyr::select(PhillyPrice.test.nhood, starts_with("sale_price"), Regression, name))

st_drop_geometry(bothRegressions) %>%
  group_by(Regression, name) %>%
  summarize(MAPE = mean(sale_price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(neighborhood) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = MAPE)) +
      facet_wrap(~Regression) +
      scale_fill_gradient(low = palette5[1], high = palette5[5],
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by neighborhood",
           caption = "Figure 3.9") +
      mapTheme()

A scatter plot of MAPE by neighborhood as a function of mean price by neighborhood.

Figure 3.10 shows the Mean Absolute Percentage Error by neighborhood as a function of mean price by neighborhood. The regression line indicates a negative slope, meaning that as the mean price of the neighborhood increases, the mean absolute percentage error will decrease and the predictions are more accurate. However, the points are not clustered together along the line.

mean_price_summary <- PhillyPrice.test.nhood %>%
  group_by(name) %>%
  summarize(meanPrice = mean(sale_price, na.rm = TRUE))

neighborhood <- st_join(neighborhood, mean_price_summary, by = "name")
neighborhood <- st_join(neighborhood, mape_neighborhood, by = "name")

ggplot(data = neighborhood, aes(x = meanPrice, y = MAPE)) +
  geom_point(alpha = 0.7) +  # color by neighborhood for distinction
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
  labs(title = "MAPE by Neighborhood as a Function of Mean Price",
       caption = "Figure 3.10") +
  theme()

Generalizability across urban contexts

We chose African American as a race context to test the generalizability across urban contexts. Figure 3.11 shows the distribution of the majority African-American and Majority non African-American block groups. The north Philadelphia is more segregated with more African-American groups, and the northwest Philadelphia is more segragated with non African-American groups. The center Philadelphia is integrated with both majorities. Table 3.12 calculated the mean absolute errors in baseline regression model and the neighborhood-effect regression. There is a 8% gap of the baseline Regression between two majorities, but the gap is bigger in Neighborhood Effects model, which is 11%. It’s not a significant gap. Therefore, the neighborhood effects makes the model slightly less generalizable.

blockgroup21 <- blockgroup21 %>%
  mutate(AfricanContext = ifelse(pctForeign > .1, "Majority African-American", "Majority non African-American"))
  
grid.arrange(ncol = 2,
  ggplot() + 
    geom_sf(data = na.omit(blockgroup21), aes(fill = AfricanContext)) +
    scale_fill_manual(values = c("#a6611a", "#018571"), name="African-American Context") +
    labs(title = "African-American Context",
         caption = "Figure 3.11") +
    mapTheme() + theme(legend.position="bottom"))

PhillyPrice.test <- PhillyPrice.test %>%
  mutate(Regression = "Baseline Regression")

bothRegressions <-
  rbind(
    dplyr::select(PhillyPrice.test, starts_with("sale_price"), Regression, name),
    dplyr::select(PhillyPrice.test.nhood, starts_with("sale_price"), Regression, name))

st_join(bothRegressions, blockgroup21) %>% 
  group_by(Regression, AfricanContext) %>%
  summarize(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(AfricanContext, mean.MAPE) %>%
  kable("html", caption = "Table 3.12 Test set MAPE by neighborhood racial context") %>%
  kable_styling(full_width = F)
Table 3.12 Test set MAPE by neighborhood racial context
Regression Majority African-American Majority non African-American
Baseline Regression 41% 33%
Neighborhood Effects 57% 68%

V. Discussion

Overall, it is not an ideally effective model. Results indicate that the regression model explains approximately 36% of the variability in housing prices. While not a high R2, it suggests that the model has some predictive power, but it doesn’t capture all factors influencing housing prices and a significant portion of the price variation remains unexplained. The F-statistic of 800.92, with an overall p-value less than 0.01, suggests that the overall model is statistically significant instead of random. But still, at least one predictor is not significantly related to housing price. However, our MAE is as high as 104648.1, and the MAPE is 37%. These metrics indicate that, on average, the predicted prices are approximately $104648.1 off from the actual prices, representing on average predictions are off by 37% of the actual prices. Cross-validation results revealed a low mean Root Mean Square Error (RMSE) of 180808.5 which indicates that our model is not capable to provide the most accurate prediction of housing prices. While the model is not problematic, it does not provide high accuracy and has obvious errors. With all these aspects of measuring effectiveness, we consider our model as medium effective model for the model has some effectiveness in predicting house prices but falls short of explaining a significant portion of the price variation. Also, the MAE and MAPE suggest that there is room for improvement in terms of prediction accuracy.

Some variables, such as total area, interior condition, shooting incidents, and median household income, have low p-values, indicating their significance in predicting housing prices. In contrast, house age and distance to schools have less significance, which did not perform as strong as we supposed before doing the analysis. Therefore, we consider housing prices is less influenced by its age since construction and distance to the nearest school. Some interesting variables we found are percent of foreign-born residents and distance to commercial corridors. Percent of foreign-born residents is a highly significant predictor with a substantial positive coefficient, suggesting that a higher percentage of foreign-born residents is associated with higher house prices. It is surprising fact that immigration can be related to higher housing prices, which is completely the opposite from the assumption we made when we are selecting this variable. Distance to commercial corridors is significant predictor with a negative coefficient, implying that being closer to commercial areas may lead to lower house prices. This result is out of our expectations as well, because we assumed that people would prefer housing with more convenience, so closer to commercial corridors would bring up the housing prices.

The maps show some evidence of spatial variation in prices. Most of the neighborhoods have housing price under 1 million, while higher housing prices are clustered in the northwest block groups as well as the downtown areas. The spatial distribution of the residuals and the Moran’s I test results suggest that there is moderate positive spatial autocorrelation in house prices, meaning that areas with high house prices tend to be clustered with high prices in their neighbors. However, this spatial variation is not uniform across all neighborhoods.

The scatterplot of predicted prices as a function of observed prices suggests that our model is doing a good job. The lines of observed prices and predict prices are in high proximity, suggesting that the model isn’t an exact fit but is still a reasonably good fit. Most of the data points align closely with the ideal fit line, but there is still a substantial number of data points where the observed sale prices significantly exceed the predicted prices. This result may be data related, the quality and quantity of data available for specific neighborhoods has some deficiency. The major reason causing this is some unaccounted factors that affect house prices. We did not take those influential factors into consideration while making the model.

VI. Conclusion

In conclusion, we achieved our primarily goal on understanding the factors influencing housing prices in Philadelphia, and we used data from OpenDataPhilly and went through all kinds of analysis including using maps, charts, and regressions. The results of our analysis went though testing by validation, accuracy and generalizability. In our analysis, the OLS regression model summarized various independent variables that might impact housing prices, including statistics for interior house condition, demographic characteristics, and exterior environment. Notably, we identify high variability, especially in some variables, as evidenced by the standard deviations.

In summary, our analysis provides insights into the factors influencing housing prices in Philadelphia and highlights the strengths and limitations of our predictive models. It might not be an optimal model for Zillow to make the price prediction, so further research and refinement of the models may be necessary to achieve higher accuracy and predictive power. First of all, we need to include more and refine the predictors in our research. School ratings or the type of school might be more close related with the housing price than the distance. The data of victims suffered from head shot is also limited, so we can enlarge it to all the victims. While dealing with variables, it is also essential to see the distribution of the variables’ data pattern. Therefore, a data transformation, such as log, would be consider to treat the data which is not normally distributed. We can also improve the model by collecting more data and make a horizontal analysis by years. The trend of the house prices in the same neighborhood across years are important to consider. We also detect some negative predicted values in our results, which are not aligned with the real world. We should consider a better way to deal with the outliers or filter out these values before predicting.

VII. References

Datasets: - OpenDataPhilly: https://opendataphilly.org/ - United States Census Bureau: https://data.census.gov/

Articles: - Steif, Ken. (2021) Public Policy Analytics: Code & Context for Data Science in Government

Posts: - MUSA 508 ED Discussion Board